BRIEFINGS IN B I O I N F O R M AT ICS. VOL 13. NO 4. 420-429 
Advance Access published on 4 February 2012 



doi:IO.I093/bib/bbr080 



Measuring the microbiome: 
perspectives on advances in 
DNA-based techniques for exploring 
microbial life 

James A. Foster, John Bunge, Jack A. Gilbert and Jason H. Moore 

Submitted: 23rd September 2011; Received (in revised form): 16th December 2011 

Abstract 

This article reviews recent advances in 'microbiome studies': molecular, statistical and graphical techniques to 
explore and quantify how microbial organisms affect our environments and ourselves given recent increases in 
sequencing technology. Microbiome studies are moving beyond mere inventories of specific ecosystems to quantifi- 
cations of community diversity and descriptions of their ecological function. We review the last 24 months of pro- 
gress in this sort of research, and anticipate where the next 2 years will take us. We hope that bioinformaticians 
will find this a helpful springboard for new collaborations with microbiologists. 
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INTRODUCTION 

We live in a microbial world, with microscopic 
organisms filling discrete ecosystems in such envir- 
onments as soil, lakes and oceans, the human gut or 
skin, and even computer keyboards. Though micro- 
biota include bacteria, archea, viruses and microscop- 
ic eukaria, we will consider only bacterial examples 
in this article. Bacteria comprise most of the Earth's 
biomass and richness [1]. They dominate ecological 
functions such as carbon cycling, greenhouse gas 
emission and oxygen production. Ninety per cent 
of the cells in a human body are bacterial, as are 
99% of the gene transcripts [2] . However, most of 
the microbial world has been inaccessible to us, a 
kind of biological 'dark matter', since we do not 
know how to culture over 97% of all bacteria, and 
since older cultivation-independent microbial survey 
techniques such as TRFLP (Terminal Restriction 



Fragment Length Polymorphism), ARISA 
(Automated Intergenic Spacer Analysis) and gradient 
gel electophoresis have significant limitations. 'Next 
Generation' sequencing technologies have enabled, 
for the first time, high-throughput microbial sam- 
pling [3]. 

Current microbiome studies extract DNA from a 
microbiome sample, quantify how many representa- 
tives of distinct populations (species, ecological func- 
tions or other properties of interest) were observed in 
the sample, and then estimate a model of the original 
community. Ambitious projects are underway to 
catalog microbial life for the entire Earth, the 
ocean and the human body [4—6]. Surveys of tran- 
scriptomes and entire genomes have revealed more 
than half of all known protein sequences. Existing 
methods for estimating richness and community 
structure from observed samples are becoming 
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more refined, improving model estimation, confi- 
dence quantification and comparative methods 
[7—9]. Finally, interactive, visual techniques are 
emerging with which to explore these complicated 
data sets prior to formal analysis. 

The new sequencing technologies have idiosyn- 
cratic strengths and weaknesses, which are not fully 
understood, and are beyond the scope of this review 
[10]. Currently, most researchers use the Roche 
454 GS-FLX or Illumina GAIIx/HiSeq2000 sequen- 
cing platforms. The Roche 454 GS-FLX Titanium 
can now generate in excess of 1 million reads per 
run, which takes 23 h, with read lengths up to 
1000 bp (average ^500 bp); the average run gener- 
ates 750 Mbp of sequencing data. The Illumina 
HiSeq2000 platform can now generate ^4 billion 
paired-end reads per run (with two flow cells of 
1 billion fragments each), which takes 10 days, 
with (usually) 150 bp paired-end reads to create an 
^250-bp product; the average run generates 1 Tbp 
of sequencing data. Of course, there is wide variation 
between individual labs for these statistics. Emerging 
technologies, such as single molecule sequencing and 
smaller single lab devices are not widely used yet, and 
Sanger sequencing of large-insert libraries is still sig- 
nificant [11]. 

Recent bioinformatics advances have significantly 
improved sequencing and assembly errors detection 
and correction. Several packages provide pipelines to 
bring these new algorithms into the lab [12, 13]. 
Bioinformaticists continue to improve algorithms 
for detecting specific types of error, such as chimeric 
sequences [14] and precise but inaccurate reads 
[15, 16]. 

In this review, we survey recent advances in 
genome-based analytical techniques to measure the 
diversity of complete microbial communities. There 
are, of course, many other ways for analytical scien- 
tists to advance microbiome studies, which we do 
not review here, such as new quality control meth- 
ods, large-scale data curation, knowledge mining and 
novel data-analytic techniques such as metaproteo- 
mics and advanced mass spectrometry. So, for work- 
ing purposes here we consider a 'microbiome' to be a 
well-defined patch of an ecosystem, such as all bac- 
teria in a prescribed sector of the ocean or all bacteria 
from a specific body part of several humans. We use 
microbial ecology terminology rather than statistical 
conventions, so that a 'population' is a collection of 
all organisms of a given species, a 'community' is a 
collection of 'populations' that share a specific 



ecosystem, and a 'sample' or 'specimen' is a physical 
extract from a given microbiome. Finally, we limit 
references for the most part to recent publications 
that serve as jumping off points for further explor- 
ation, rather than a complete literature survey. 

In this article, first we discuss studies based on 16S 
rRNA amplicons. Next, we review analyses of meta- 
genomic and metatranscriptomic data from shotgun 
sequencing of multiple genomes or genome tran- 
scripts. We then consider advances and limitations 
in statistical techniques for diversity estimation. 
Then we discuss visual analytics, hypothesis gener- 
ation by visually exploring these very large sequence 
data sets. Finally, we speculate on how microbiome 
studies may change in the next 2 years. 



16S RRNA AMPLICON ANALYSIS 

Hypervariable regions of individual, highly con- 
served genes, such as the small ribosomal subunit in 
noneukaryotes, have served as proxies for species 
since Woese and Fox [17—19] first used them to 
demonstrate that archea were a separate kingdom. 
With new sequencing technologies based on the 
polymerase chain reaction (PCR) it became possible 
to sample all the 16S rRNA genes in a specimen 
without having to isolate and cultivate organisms in 
order to amplify DNA separately. By tagging speci- 
mens with molecular barcodes, labs can multiplex 
several treatments and controls into a single sequen- 
cing run, making it possible to survey and compare 
different specimens with very few sequencing jobs, 
dramatically shrinking the time between sample 
preparation and data analysis and the sequencing 
costs. 

The 16S rRNA gene remains a good but far from 
ideal molecular marker for microbial diversity, and 
there is no obvious alternative. 16S rRNA genes 
from hundreds of thousands of organisms have 
been fully sequenced and classified [13, 20]. As 
with all databases, ribosomal databases are growing 
larger and better, so analysis relying on them can 
only improve. The secondary structure of the 16S 
rRNA molecule is well characterized, at least for 
reference strains, which makes it possible to perform 
fast, secondary structure driven alignments [21, 22]. 
However, as with any single gene, the diversity of 
the 16S rRNA gene does not always reflect phylo- 
genetic relationships or metabolic potentials that are 
known from other sources [23]. Current studies 
rarely resolve sequences below the family level 
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(even for known strains) due to limited database 
depth, though the algorithms themselves are capable 
of finder resolution. Consequently, results are often 
reported at the order or even phyla level, even 
though different species or even strains are likely to 
have very different roles in microbiomes. Database 
sequences are surely biased samples of reality, since 
they assume at least that their targets are amenable to 
existing sequencing and annotation methodologies. 
They have been further biased by a historical fixation 
on potential pathogens and environmental contam- 
inants. However, the 16S rRNA gene is likely to 
remain the most reliable and broadly applicable 
marker for some time. 

To date, only small 16S rRNA gene fragments, 
rather than entire genes or genomes, have been 
amenable to sequencing. Primers exist for hypervari- 
able regions known as VI through V9, of widely 
varying lengths and phylogenetic resolution [24]. 
Different regions, and combinations of regions, 
have different strengths and weaknesses [25, 26]. 
Historically, human microbiome surveys typically 
sample from regions near V3, while environmental 
surveys often sample from regions near V6, though 
evidence indicates that V2 and V4 are less error 
prone and most project in the NIH Human 
Microbiome Project use the V3-V5 region [27]. As 
sequencing technologies and protocols improve, 
projects are sequencing longer regions, such as 
V3-V5 (from the beginning of V3 to the end of 
V5) or V6-V9. Eventually, it may become routine 
to use the entire 16S gene, multiple marker genes, or 
even entire genomes. 

There are two types of algorithms for inferring 
microbiome diversity and structure from 'clean' 
sequences, and both have improved greatly in the 
last 2 years. 

Clustering methods group sequences by similarity, 
computing statistics from the number and size of 
clusters. Clustering methods are sensitive to how 
one measures similarity and what similarity threshold 
one uses [25, 28]. Older distance clustering methods 
begin by comparing all pairs of sequences, producing 
massive distance matrices. Newer algorithms com- 
pute clusters on the fly, requiring far less computer 
memory. Clusters are often called Operational 
Taxonomic Units (OTUs), a term borrowed from 
systematics, though the basis for clustering does not 
always reflect organismal phylogeny or functional 
diversity. Recent studies have shown that, in general, 
average neighbor clustering (usually at a 97% 



similarity threshold) following single linkage cluster- 
ing (usually at a 98% similarity threshold) works 
better for estimating community diversity than alter- 
natives [16]. Very few algorithms exist that rigor- 
ously fit statistical models to sequence data in order 
to estimate microbiome structure (see below). 

Classification methods, on the other hand, weight 
their analysis with metadata such as estimates 
of phylogenetic or functional relationships. 
Increasingly sophisticated algorithms, including 
Bayesian inference, match experimental sequences 
to those in existing databases [13, 20, 29], which 
are continually updated [13, 29]. Classification meth- 
ods, including phylogeny-informed analyses [30, 31], 
help with research projects where it is important to 
know more than the diversity of a microbiome; for 
instance the number of organisms likely to be related 
to potential pathogens or the likely functional cap- 
acity of a community. UniFrac algorithms estimate 
between-population (so called 'beta') diversity, 
informed by estimated phylogenetic divergence 
between samples [32]. These techniques will im- 
prove over time with rapidly improving databases 
and phylogenetic estimation algorithms. However, 
they are limited by the very small number of 
sequenced organisms relative to what exists in 
nature, by the computational complexity of current 
phylogenetic estimation algorithms, and by the 
problematic nature of the species concept for bac- 
teria. Moreover, many organisms in the databases are 
still unclassified, having been recalcitrant to current 
taxonomic methods [25, 33]. 

METAGENOMICS/ 
METATRANSCRIPTOMICS 

Researchers use metagenomic and metatranscrip- 
tomic sequencing to explore the functional and ex- 
pressed potentials of microbial communities. Most 
studies have performed extensive sequencing of bac- 
terial communities [34]. But viral [35] and eukaryotic 
[36] communities have also been studied. Indeed, 
recent metagenomic data analysis is being used to 
expand the breadth of perceived phylogenetic 
space [37]. 

The difficulty of assembling and annotating the 
data, due to short read lengths, has been the primary 
challenge to analyzing high-throughput metage- 
nomic /me tatranscriptomic data [38]. Assembly is im- 
portant for the reconstruction of genes and operons 
for functional assignment and improved annotation 
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of taxonomy [39], but also for re-assembly of 
whole genomes from metagenomic DNA [40]. 
Independently of assembly problems, functional an- 
notation is a difficult problem, compounded by the 
sheer quantity of sequence data. Consequently, auto- 
mated annotation has become routine, with little or 
no manual assessment of accuracy [41]. One of the 
most appropriate ways of defining the accuracy of 
assembly and annotation of metagenomic data are 
to use in silico simulated data from fragmented gen- 
omes [42] or actual fragmented genomic DNA from 
known organisms [43]. 

Nonetheless, comparative metagenomics remains 
one of the most powerful ways to explore gene dis- 
tribution across different ecosystems [44]. Several 
tools and technologies exist for comparing functional 
community dynamics across different metagenomic 
data sets [45] . Current techniques are limited by dif- 
ficulties in contextualizing sequencing data with en- 
vironmental metadata from the target ecosystem 
[46]. However, techniques are being developed to 
improve these analyses, once environmental meta- 
data about the niche space in which the community 
was structured becomes available [47]. 

It is possible to model complex community dy- 
namics in relation to the chemical and physical 
dynamics of the ecosystem, even without exhaustive 
sequence and environmental data. For example, tools 
exist to derive the abundance of gene/transcript 
fragments annotated to known enzyme activities 
from metagenomic and metatranscriptomic data 
[48]. In addition, bioclimatic models are being de- 
veloped to extrapolate the responses of bacterial 
community structure to environmental change, and 
how this will affect relative changes in the con- 
sumption or production of metabolites in an 
ecosystem [49]. 

STATISTICS FOR DIVERSITY 
ESTIMATION 

The statistical challenges for microbiome studies are 
to estimate population richness and diversity, model 
community structure, quantify uncertainty and com- 
pare estimates rigorously [50]. This is true whether 
the analysis is based on clustering or classification- 
based methodologies. We divide the relevant pro- 
cedures into two groups: (i) methods that treat the 
observed sample as the community and (ii) methods 
that account for the existence of unobserved 
(unsampled) organisms or taxa in the community. 



The former group is represented by procedures 
such as UniFrac [32]. These methods are extremely 
useful and informative and are well-documented and 
implemented in current software (e.g. mothur, 
QIIME) so we do not address them here. The 
latter group consists of quantitative, inferential stat- 
istical procedures, that is, methods that estimate true 
but unknown numerical measures of diversity, such 
as the total taxonomic richness of a community, 
both observed and unobserved). These methods are 
described mainly in the theoretical statistical litera- 
ture, which bioinformatics specialists are less likely to 
read. So we focus on them in this expository article. 

Most current techniques begin with frequency 
count data, which groups observations into bins 
and report the number of members of each bin. 
There are two main approaches to richness estima- 
tion from such count data. The classical or frequentist 
approach is better represented both in the literature 
and in available software. Coverage-based nonpara- 
metric estimators like Chao and ACE are popular, 
being simple to compute, and are available in bio- 
informatics packages such as mothur and QIIME 
[12, 51]. But they are known to underestimate the 
true diversity in high- diversity situations, and to 
behave erratically when outliers are present [50]. 
Recently, more stable but computationally intensive 
parametric mixture models have been introduced. 
Both types of estimate are available in a single pack- 
age, CatchAll [7]. Further, CatchAU computes sev- 
eral different estimates and returns a ranked 
comparison of the 'best' analyses for a given data set. 

The Bayesian approach, in contrast, begins with a 
prior probability distribution that represents what is 
known or believed about the diversity before col- 
lecting any data. Using Bayes' Theorem, this 
approach then derives a posterior distribution using 
the observed data, which yields the final estimate of 
diversity along with error terms and confidence 
intervals. There are two ways to define the prior. 
In 'objective' or 'non-informative' Bayesian analysis 
one minimizes the amount of information in the 
prior so that it influences the end result as little as 
possible; while in subjective or informative Bayesian 
analysis the prior expresses the experimenter's beliefs 
about the diversity, or weights the results according 
to known factors that are unrelated to the observed 
data. Both have been studied in the diversity estima- 
tion literature, but the objective Bayesian approach is 
more widely accepted [52, 53]. Indeed it promises to 
be statistically and computationally stable and 
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flexible, and may well be a strong competitor to the 
frequentist methods. But at present there is no simple 
and generally accessible Bayesian diversity estimation 
software, so we have less applied experience than 
with the classical approach. 

Recently, statistical methods have been developed 
that adjust estimates according to patterns in or as- 
sumptions about the frequency count data. For ex- 
ample, the successive ratios of frequencies (the 
number of doubletons divided by the number of 
singletons, tripletons divided by doubletons, etc.) 
have known statistical properties, which led to a 
new estimation method (available in CatchAll) 
[54]. Another example incorporates suspected unre- 
liability of low frequency counts into diversity esti- 
mates. Recent analyses of artificially constructed 
communities with known diversity and structure in- 
dicate that existing methods may systematically lead 
to inflated low frequency counts. Strategies to ad- 
dress such biases include: (i) using a Bayesian prior 
weighted toward lower diversity values; (ii) reporting 
lower bounds rather than direct estimates for the 
total diversity; (iii) statistically separating the pro- 
jected population into low and high- diversity com- 
ponents and deleting or downweighting the latter 
and (iv) by pooling low frequency counts up to 
some cutoff (say, the singletons and doubletons) 
and re-estimating the total diversity from these 
left-censored data [55]. All of these strategies are stat- 
istically feasible, although not all have been imple- 
mented in software [CatchAll includes (ii) and (iii)], 
and this remains an area of current research. 

The next logical step is to move from estimating 
the diversity of a single community ('population' in 
the statistical sense) to comparing diversity levels 
across two or more communities. Given reliable 
richness estimates for individual communities, it is 
straightforward to make statistical comparisons of 
richness between microbiomes. It is considerably 
more challenging to quantify how much population 
structure is shared between two or more commu- 
nities. One common metric for two communities is 
the Jaccard index, which is the ratio of the number of 
shared populations to the total number of popula- 
tions observed. Other between-community diversity 
metrics include Sorensen, Bray-Curtis and 
Morisita-Horn [51]. However, these formulae are 
often used to compare observed samples rather 
than estimated communities, leading to statistically 
indefensible practices such as discarding data to 'nor- 
malize' samples to the same size. What is lacking is 



between-community diversity metrics that account 
for both observed and unobserved populations. This 
appears to be a challenging statistical problem. Chao 
et al. [8] provided a nonparametric estimator of the 
true, community-level Jaccard and Sorensen indices. 
But, few other solutions have been proposed [56]. 

Finally, microbiome studies need to model or pre- 
dict richness and diversity using covariate data, such 
as observable biological, chemical, or other environ- 
mental variables. If the response or dependent vari- 
able is simply the (estimated) richness then standard 
statistical modeling techniques such as regression are 
appropriate. But, modeling diversity and structure, 
rather than just richness, as a function of the predict- 
ors, requires techniques such as canonical corres- 
pondence analysis [9]. 

All these analyses should be based on estimates 
of unobserved structure, rather than exclusively 
observed data, since substantial unobserved diversity 
is typical of microbial ecology studies. 

VISUALIZING THE RESULTS 

Microbiome data are inherently high dimensional 
and complex. Suppose the goal of a project is to 
relate bacterial community structure at a particular 
body site to clinical observations. A typical data set 
might include a list of hundreds of bacterial species 
that are hierarchically organized into different 
groups, including genera, families, orders, classes 
and phyla. This is further complicated by informa- 
tion about genes and pathways that are present in 
each of the bacterial species and how these relate 
to clinical endpoints. The genomic information of 
the host, such as demographic data, patient specifics 
and lifestyle data may also be important. The ultim- 
ate challenge is to put these many different layers of 
information together in a statistical or machine learn- 
ing analysis to identify clinically useful patterns. 

Given this level of data complexity, it is important 
for the researcher to have tools with which to visu- 
alize and explore data. Visual interaction allows the 
researcher to critically explore the measurements 
themselves for quality control, for discovering pat- 
terns that lead to new hypotheses, and for interpret- 
ing results. Also, it is often desirable to communicate 
results visually to other scientists and clinicians. 
However, it is challenging to choose the right visu- 
alization technique for the right type of data or in- 
formation, given that there are so many information 
visualization methods [57]. 
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Several different information visualization meth- 
ods have been useful for the analysis of microbiome 
data. For example, heat maps, introduced >50 years 
ago [58], have become popular and useful for visua- 
lizing population structure in large microbial com- 
munities and for clusters of expression patterns in 
genomics [59]. A heat map consists of a 2D grid or 
matrix of colored squares where each square repre- 
sents an observation of a variable and the color of the 
square is proportional to the value of that observa- 
tion. It is common to order the squares along the 
two axes with additional categorical data such as bac- 
terial phyla and tissue type. For example, a recent 
study by Wu et al. [60] explored the relationship be- 
tween long-term dietary patterns and gut microbial 
enterotypes. This study used Spearman correlations 
to estimate the association between different nutri- 
ents and bacterial genera in 98 healthy volunteers. It 
summarized the results with a heat map, where each 
column represented different taxa, each row repre- 
sented a different nutrient and the color of each 
square represented the magnitude of the correlation, 
with darker red representing stronger positive correl- 
ations and darker blue representing stronger negative 
correlations. Wu et al. also performed a hierarchical 
cluster analysis to organize the results into visual pat- 
terns that were easier to interpret. For example, the 
authors found that fat-related nutrients tended to be 
more similar in the correlations across taxa than other 
nutrient groups. In addition to heat maps, the au- 
thors also used principal components analysis (PCA) 
to identify linear combinations of gut microbial taxa 
associated with long-term diet. They used 2D and 
3D scatterplots to identify clusters of patients defined 
by the first two or three principal components. This 
type of multivariate analysis is inherently visual and 
can prove to be a very useful information visualiza- 
tion tool for microbiome analysis. 

Some recent projects move beyond visualization 
into visual analytics, which closely integrates compu- 
tational analysis and visualization and human-com- 
puter interaction [61]. This is distinct from 
information visualization, which focuses on methods 
such as heat maps for showing high-dimensional re- 
search results, and scientific visualization, which 
focuses on the mathematics and physics of visualizing 
complex objects. What distinguishes visual analytics 
is the integration of data analysis with visualization 
methods so that data analysis can be launched directly 
from the visualization, and the visualization adjusted 
in response to the data analysis. Computer hardware 



such as the Microsoft Surface Computer or the 
Apple iPad enable and democratize visual analytics. 
All of this combined with a 3D visualization screen 
or display wall provides a modern visual analytics 
discovery environment that immerses users in their 
data and research results. 

For example, Ravel et al. [62] used movies to ex- 
plore and display the temporal variation in the vagi- 
nal microbiome of 396 women from different racial 
groups, and work is underway to incorporate tem- 
poral and patient metadata. The use of movies allows 
users to interact with the visualization in a way that is 
not possible with static images. As another example, 
one can extend the traditional heat map by integrat- 
ing and rendering additional information along the 
z-axis [63]. This additional visual dimension 
enhances the visual discovery process. In this study, 
the authors implemented the 3D heat map using a 
commercial 3D video game engine called Unity 3D. 
(The authors chose the Unity3D development tool 
because it uses Mono, the open-source, cross- 
platform. NET implementation, so as to not be lim- 
ited to code libraries supplied by the vendor.) Unity 
makes graphic-user interface (GUI) code easy to 
write, enabling rapid prototyping, and the workflow 
for incorporating assets from other tools such as 
Maya and Photoshop is straightforward. An add- 
itional advantage is that Unity can use Direct3D on 
Windows machines, which allows users to employ 
off-the-shelf drivers to view 3D heat maps in stereo 
on suitable equipment. OpenGL would require ex- 
plicit coding to see the view from each eye to pro- 
duce stereo. The ability to easily see 3D heat maps in 
stereo is important given the widespread and emer- 
ging availability of 3D televisions and computer 
monitors, and leveraging game development systems 
for data analysis engages powerful market forces to 
enhance scientific analyses. 

Another important benefit of using video game 
engines for visual analytics is that they make it pos- 
sible to interact with the 3D visualization as you 
would in a video game. Animation, sound and 
point and click interaction with the data on the 
screen enable the user to experience their data in 
creative ways. The end result is an open-source soft- 
ware package that combines human— computer inter- 
action and visualization in a 3D heat map in a way 
that is not possible with common analysis tools such 
as Microsoft Excel or R. Figure 1 illustrates the GUI 
for the 3D heat map software package. Also, illu- 
strated is one view of the microbiome data from 
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Figure I: Screenshot of the 3D heat map application showing menus for data selection, chart style, viewpoint, 
chart view and cluster analysis. Each menu can be minimized or hidden. Illustrated are human microbiome data. 
Each row is a microbe with the name shown on the y-axis. Each column is a different subject and time point. 
The z-axis represents the relative abundance of the microbes. The 3D heat map makes it possible to add additional 
layers of information in the fourth and fifth dimensions, using colors (see online documents). 



Moore et al. [63] . The software allows you to load 
data from an SQLite database, select color schemes, 
select visualization settings and even perform a clus- 
ter analysis as a way to organize the results. Here, 
each row represents a different microbe. The height 
and side color (green to red) of the bars represent the 
relative abundance of each microbe while the col- 
umns represent different patients (colored yellow to 
blue) at different time points in chronological order. 
A 3D mouse or keyboard controls and a standard 
mouse make it possible to interactively explore the 
data. A central challenge for adapting these kinds of 
visualization tools for microbiome data will be the 
integration of phylogenetic information. 

WHERE THINGS ARE GOING 

Sequencing technologies will continue to improve in 
both accuracy and throughput, and bench top se- 
quencers will become standard equipment in 



individual labs. Amplicon techniques will rely more 
on whole gene samples, perhaps from multiple 
genes, removing the bias associated with selecting 
fixed fragments of a particular gene. This will in- 
crease the need for tools that deduce phylogenies 
from gene genealogies. Complete 16S rRNA gene 
sequences will remain the standard for microbial sys- 
tematics for some time. However, we anticipate that 
amplicon analysis will become a quick screening 
technique, preliminary to more detailed metage- 
nomic studies, rather than the final stage in ecological 
analysis. 

The ideal data set for genomic-based microbial 
studies of any given ecosystem, including those asso- 
ciated with animals, including humans, is a complete 
genome for every organism at a given time in the 
ecosystem. When combined with temporal observa- 
tions, it might be possible to completely characterize 
the genetic diversity of the system by sequencing the 
dominant organisms as the system changes. When 
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the temporally situated, approximated genomes of 
the dominant members are sequenced, it may be 
possible to generate comprehensive models of mi- 
crobial metabolism and interactions and to design 
experiments that manipulate the system by adding 
or removing specific populations. The most obvious 
route toward such comprehensive data sets is single 
genome isolation and sequencing [64] . This technol- 
ogy is currently performed by isolating single micro- 
bial cells and sequencing them directly. It is used to 
identify the functional potential of organisms and to 
design economically feasible, rather than exhaustive, 
shotgun metagenomics studies. Naturally, it will be 
difficult to sample very low-abundance organisms, or 
to sample deeply enough to detect minor genomic 
variations. Limited coverage is a technological chal- 
lenge, which is likely to be overcome by new tech- 
nologies. But sequencing depth, may be endemic to 
microbiome studies if small genomic variations are 
discovered that significantly alter community 
functions. 

But the ultimate objective of microbiome studies 
is to build complete, predictive models of how 
microbiomes interact and respond to stimuli such 
as climate change, agricultural practices and disease 
[65]. Parameterizing such complex models will con- 
tinue to require metatranscriptomic and other 'omic' 
studies of the expressed capability of community 
members [33, 66, 67]. Using techniques such as au- 
tonomous collection and preservation of microbial 
communities for metatranscriptomic analysis com- 
bined with quantitative characterization of transcrip- 
tion in metatranscriptomic data, we may start to see a 
revolution in our ability to quantify functional cap- 
ability [68, 69]. 

Statistical improvements will occur in parametric 
model estimation, error and uncertainty bounds, and 
in comparing diversity statistics, especially in terms of 
comparison of communities. These improvements 
are likely to include refined techniques for censoring 
unreliable data, without first characterizing where 
the noise comes from. We also anticipate that soft- 
ware tools will become more available for sophisti- 
cated analyses, but that interpreting results will still 
require statistical expertise. 

Information visualization and visual analytics will 
become standard parts of microbiome research 
workflows. Integration into statistical computing 
software such as R is already underway, so that ana- 
lyses can be launched directly from visualization ap- 
plications. The ability to launch statistical analyses 



directly from the visualization environment opens 
the door to making discoveries that are inspired by 
visual cues, rather than preconceived hypotheses that 
are dependent on existing knowledge. 
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